基于内部语言模型估计(ILME)语言模型(LM)融合已显示出明显改善的识别结果,而识别域内和跨域语音识别任务的常规浅融合。在本文中,我们试图将ILME方法应用于跨域代码转换语音识别(CSSR)工作。具体而言,我们的好奇心来自几个方面。首先,我们很好奇基于ILME的LM融合对内域和跨域CSSR任务的有效性。我们在不合并两个代码转换域的情况下对此进行验证。更重要的是,我们通过合并两个单语言数据集训练端到端(E2E)语音识别模型,并观察到拟议的基于ILME的LM Fusion对CSSR的功效。来自东南亚和另一个中国大陆CS数据集的SEAME的实验结果证明了拟议的基于ILME的LM融合方法的有效性。
translated by 谷歌翻译
运动同步反映了相互作用二元组之间身体运动的协调。强大的深度学习模型(例如变压器网络)对运动同步的估计已自动化。但是,与其设计用于运动同步估计的专业网络,不如先前基于变压器的作品从其他任务(例如人类活动识别)中广泛采用了体系结构。因此,本文提出了一种基于骨架的图形变压器来进行运动同步估计。提出的模型应用了ST-GCN,这是一种空间图卷积神经网络,用于骨骼特征提取,然后是用于空间特征生成的空间变压器。空间变压器的指导是由相同的个体相同关节之间共享的独特设计的关节位置嵌入。此外,考虑到身体运动的周期性固有性,我们将时间相似性矩阵纳入了时间注意计算中。此外,与每个关节相关的置信度得分反映了姿势的不确定性,而先前关于运动同步估计的作品尚未充分强调这一点。由于变形金刚网络要求大量的数据进行训练,因此我们使用人类36M,一个用于人类活动识别的基准数据集构建了一个用于运动同步估算的数据集,并使用对比度学习鉴定了我们的模型。我们进一步应用知识蒸馏以减轻姿势探测器失败以隐私的方式引入的信息损失。我们将我们的方法与PT13上的代表性方法进行了比较,PT13是从自闭症治疗干预措施中收集的数据集。我们的方法达到了88.98%的总体准确性,并在保持数据隐私的同时超过了同行。
translated by 谷歌翻译
运动同步是指互动人的动作之间的动态时间联系。运动同步的应用是广泛而广泛的。例如,作为队友之间的协调量度,体育中经常报告同步分数。自闭症社区还将运动同步视为儿童社会和发展成就的关键指标。一般而言,原始视频录制通常用于运动同步估计,并且可能会揭示人们的身份。此外,这种隐私问题也阻碍了数据共享,这是自闭症研究不同方法之间公平比较的主要障碍。为了解决这个问题,本文提出了一种用于运动同步估计的合奏方法,这是在隐私保护条件下进行自动运动同步评估的首批基于深度学习的方法之一。我们的方法完全依赖于可公开共享的身份不足的二级数据,例如骨架数据和光流。我们在两个数据集上验证我们的方法:(1)从自闭症治疗干预措施中收集的PT13数据集以及(2)从同步潜水竞赛中收集的TASD-2数据集。在这种情况下,我们的方法优于其对应方法的方法,包括深层神经网络和替代方法。
translated by 谷歌翻译
通常承认,巨额(培训)数据的可用性是人工智能(AI)最近进步的最重要因素之一。但是,数据集通常用于狭窄的AI子区域中的特定任务,并且没有统一的方式来管理和访问它们。这不仅在培训或部署机器学习模型时创造了不必要的开销,但也限制了对数据的理解,这对于以数据为中心的AI非常重要。在本文中,我们向不同数据集的统一框架展示了我们的愿景,以便可以轻松地集成和查询,例如,使用标准查询语言。我们在持续的工作中展示了这一点,为计算机愿景中的数据集创建了一个框架,并在不同的场景中显示了它的优势。我们的演示可在https://vision.semkg.org中获得。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译